Author

Theresa Szczepanski

Research Question

I work for a public charter school in Massachusetts. Our students in Grades 5-10 are tested annually in Mathematics, English Language Arts, and Science. I am interested in using the student performance data to identify areas of weakness in our program in terms of curricular alignment. For every question on a given assessment, the state releases an item description detailing what was asked of students and the corresponding average score earned by students in our school as well as the average score earned by students across Massachusetts.

My research question is: “Can test item descriptions on MCAS assessments be mined to extract features associated with student performance trends?”

Hypothesis

I have already found statistically significant patterns in student performance associated with an item’s content reporting category in every subject and grade level at our school.

In Mathematics, I have identified specific content reporting categories that are relative weaknesses at different grade levels; however, this does not take into account the differences in questions like those Uurrutia and Araya classified in open-ended Math prompts. I would be curious to see if our students are weaker in items that ask them to evaluate an expression (apply technical skills in algebra/numeracy) vs. construct or interpret a mathematical model (conceptual understanding). This would be very interesting information for teachers.

When interviewing one of our experienced teachers who has historical success with student achievement in English Language Arts (ELA), she identified specific things that she believes all kids need to practice for success with ELA, namely, “synthesizing multiple texts” and “practice at the differences in reading informational text vs. literature”. These are requirements in questions that can be mined from item descriptions but not from an item’s reporting category or standard description. Our students have historically performed weaker on the 7th grade English Language Arts exam than on the 6th and 8th grade exams. This suggests a curricular alignment issue. I’ve already identified reporting categories in which our students have performed relatively weaker on this assessment. I suspect that within these reporting categories there exist patterns to the types of questions or tasks that our students struggle with. This could provide valuable information for teachers to adjust their instruction and instructional materials.

Hypotheses:

H1: A predictive model of student performance on Grade 5-10 Mathematics MCAS assessment items that includes regressors taken from the test item descriptions will outperform a baseline predictive model that includes only a given test item’s content reporting category.

H2: A predictive model of student performance on Grade 5-10 English Language Arts MCAS assessment items that includes regressors taken from the test item descriptions will outperform a baseline predictive model that includes only a given test item’s content reporting category.

H3: A predictive model of student performance on Grade 5-8 grade Science assessment items that includes regressors taken from the test item descriptions will outperform a baseline predictive model that includes only a given test item’s content reporting category.

Data Sources

I scraped the Department of Elementary and Secondary Educations’ accountability page. It includes tables for all Next Generation MCAS tests (beginning in 2018) and text descriptions for all test items. I have actually already done this for the High School Science exams in my Python course last fall. The structures of the tables are similar for the other exams.

Here is my folder of Collab notebooks with my Python code for scraping different grade level and subjects. Here is a link to the notebook I used for the HighSchool Science MCAS

  • Math Corpus: 1010 Documents (item descriptions)

  • STE Corpus: 510 Documents, MS Only: 398 Documents

  • ELA Corpus: 693 Documents

In Uurrutia and Araya’s paper, they used classification to categorize open-ended mathematics questions into different types. I would like to use dictionaries to classify MCAS questions into different categories using the item descriptions.

Then I would like to use Supervised Machine Learning Methods that include the features created from the dictionaries as well as features that I extracted from the MCAS reporting tables to predict student performance relative to their peers in the state, RT-State Diff, on a given item.

I would like to use this same approach for the English Language Arts and Science exams as well.

Load Libraries

Code
library(caret)
Loading required package: ggplot2
Loading required package: lattice
Code
library(devtools)
Loading required package: usethis
Code
library(e1071)
library(ggplot2)
library(ggpubr)
library(hrbrthemes)
library(Metrics)

Attaching package: 'Metrics'
The following objects are masked from 'package:caret':

    precision, recall
Code
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
Code
library(plyr)

Attaching package: 'plyr'
The following objects are masked from 'package:plotly':

    arrange, mutate, rename, summarise
The following object is masked from 'package:ggpubr':

    mutate
Code
library(purrr)

Attaching package: 'purrr'
The following object is masked from 'package:plyr':

    compact
The following object is masked from 'package:caret':

    lift
Code
library(quanteda)
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.
Code
library(quanteda.textstats)
library("quanteda.textplots")
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
Code
library(RColorBrewer)
library(tidytext)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ stringr   1.5.1
✔ forcats   1.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()       masks plyr::arrange(), plotly::arrange()
✖ dplyr::combine()       masks randomForest::combine()
✖ purrr::compact()       masks plyr::compact()
✖ dplyr::count()         masks plyr::count()
✖ dplyr::desc()          masks plyr::desc()
✖ dplyr::failwith()      masks plyr::failwith()
✖ dplyr::filter()        masks plotly::filter(), stats::filter()
✖ dplyr::id()            masks plyr::id()
✖ dplyr::lag()           masks stats::lag()
✖ purrr::lift()          masks caret::lift()
✖ randomForest::margin() masks ggplot2::margin()
✖ dplyr::mutate()        masks plyr::mutate(), plotly::mutate(), ggpubr::mutate()
✖ dplyr::rename()        masks plyr::rename(), plotly::rename()
✖ dplyr::summarise()     masks plyr::summarise(), plotly::summarise()
✖ dplyr::summarize()     masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(tree)
library(viridis)
Loading required package: viridisLite

Read-in/Tidy Data

Most of the cleaning was done by me in the process of scraping the tables, which is included in my folder of Python Collab notebooks.

STE Data Frames

Code
STE_DF<-read_csv("STE_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `discipline core idea`, `standard`, `standard desc`, `pts`, `school%`, `state%`, school_state_diff, grade_level, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = as.factor(grade_level))
Rows: 521 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): type, reporting category, standard, item description, discipline co...
dbl (7): year, number, pts, school%, state%, school_state_diff, grade_level

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
STE_DF

ELA Data Frames

Code
ELA_G5_DF<-read_csv("G5_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 5)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 171 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G5_DF


ELA_G6_DF<-read_csv("G6_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 6)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 171 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G6_DF

ELA_G7_DF<-read_csv("G7_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 7)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 172 Columns: 24
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(7): ...1, year, pts, school%, state%, school_state_diff, Unnamed: 10 lgl (8):
released, Unnamed: 3, Unnamed: 4, Unnamed: 5, Unnamed: 6, Unnamed: ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G7_DF

ELA_G8_DF<-read_csv("G8_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 8)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 171 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G8_DF


ELA_G10_DF<-read_csv("G10_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 10)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 173 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G10_DF

ELA_DF<-rbind(ELA_G5_DF, ELA_G6_DF)

ELA_DF<-rbind(ELA_DF, ELA_G7_DF)

ELA_DF<-rbind(ELA_DF, ELA_G8_DF)

ELA_DF<-rbind(ELA_DF, ELA_G10_DF)

ELA_DF<-ELA_DF%>%
  filter(`type` == "SR")

ELA_DF

Math Data Frames

Code
Math_G5_DF<-read_csv("G5_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 5)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 19
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(11): item_library_URL, standard_map_URL, type, reporting category, stan... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G5_DF


Math_G6_DF<-read_csv("G6_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 6)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 17
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G6_DF

Math_G7_DF<-read_csv("G7_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 7)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G7_DF

Math_G8_DF<-read_csv("G8_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 8)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G8_DF


Math_G10_DF<-read_csv("G10_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 10)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 210 Columns: 17
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G10_DF

Math_DF<-rbind(Math_G5_DF, Math_G6_DF)

Math_DF<-rbind(Math_DF, Math_G7_DF)

Math_DF<-rbind(Math_DF, Math_G8_DF)

Math_DF<-rbind(Math_DF, Math_G10_DF)

Math_DF

PreProcessing Approach

For each exam I followed a similar approach:

  1. Create a dataframe of all of the item descriptions and features across all grade levels and subjects. I kept as annotations all of the features provided by the state which include: year, standard, reporting category, cluster (sub category), pts, state%, school%, school_state_diff, andtype(open response or selected response).

  2. Using preText, I looked for recommendations via the lowest score after running factorial_preprocessing and selected tokens accordingly. Interestingly, this recommended to not remove stop words in the Math and Science corpora. I believe this is perhaps because of the key logical role words like “and” ,“but”, “not”, and “or” play. These stop words do dominate the text descriptions though. Because of this, I also created a second set of tokens that included the removal of stopwords that I used for text visualizations and descriptive analysis of text. When applying dictionaries and using machine learning, I will use texts that include stop words. Also, whenever there was a recommendation to use stemming, I instead used lemmatization.

  3. Create subset document matrices for the items that students performed the best on and a matrix for the items that student performed the worst on.

  4. Create visuals of the word connections for all item descriptions as well as for the items descriptions in the best and worst subsets to explore differences.

  5. Using n-grams, identify key phrases from the item descriptions.

  6. Create a dictionary that uses key phrases identified in steps 4 and 5 to tag an item as including any of the following

    Math:

    • modeling: A modeling problem

    • apply_division: Problem requiring the use of division and division facts

    • patterns_algebra: Problem requiring the manipulation of equivalent algebraic or numeric expressions

    ELA:

    • poetry: Students are asked questions involving where the source text is a poem

    • figurative_language: question requiring the analysis of figurative language

    Science:

    • analyze graph: Students are asked questions involving the analysis of a graph

    • analyze model: Students are asked questions involving the analysis of a model

  7. Update the Subject_DF to include the new features identified via the subject area dictionary for each of the documents to use as a feature for Supervised Machine Learning model that will predict school_state_diff.

Math Exam

Create Corpus

Code
Math_item_corpus <-corpus(Math_DF, text_field = "item description")
Warning: NA is replaced by empty string
Code
#print(Math_item)

#(summary(Math_item_corpus))

Math_DF

PreTextPreProcessing Decisions

Before completing the pre-processing process, I examined the different choices using pre-Text. I was surprised to see that removing stop words had a positive correlation coefficient.

I can see how key logical words like “not”, “and”, and “or”, which are also stop words can have a significant impact on the meaning of an exam question. Perhaps, because each individual text is so small and the texts are designed for assessing content skills and are not narrative text, the stop words play more significant roles?

Given, these results, I will pre-process the data two ways, once using the lowest recommended score, N, but also going lowercase as I can’t see how this has an effect on anything and has a correlation coefficient of 0, and once using methods that includes removing stop words to use for text visualization and simple descriptive summaries of the text:

  • Recommended preText score: “N-L” (remove punctuation and lowercase)

  • Alternative approach: “P-N-L-3” (remove punctuation, remove numbers, lower case, and n-grams)

Code
# Sys.unsetenv("GITHUB_PAT")
# devtools::install_github("matthewjdenny/preText")
# library(preText)
Code
# preprocessed_documents_math <- factorial_preprocessing(
#     Math_item_corpus,
#     use_ngrams = TRUE,
#     infrequent_term_threshold = 0.2,
#     verbose = FALSE)
Code
# #names(preprocessed_documents_math)
# 
# head(preprocessed_documents_math$choices)
Code
# 
# preText_results <- preText(
#     preprocessed_documents_math,
#     dataset_name = "Math MCAS Item Descriptions",
#     distance_method = "cosine",
#     num_comparisons = 20,
#     verbose = FALSE)
# 
# 
# preText_score_plot(preText_results)
Code
# regression_coefficient_plot(preText_results,
#                             remove_intercept = TRUE)

Tokenization 1: N-L

Code
## Extract the tokens

Math_item_tokens <- tokens(Math_item_corpus)

#print(Math_item_tokens)
Code
Math_item_tokens1 <- tokens(Math_item_corpus, 
    remove_numbers = T)


Math_item_tokens1 <- tokens_tolower(Math_item_tokens1)

# Math_item_tokens1 <- tokens_select(Math_item_tokens1,
#                    pattern = stopwords("en"),
#                   selection = "remove")

print(Math_item_tokens1)
Tokens consisting of 1,010 documents and 12 docvars.
text1 :
 [1] "determine"   "the"         "coordinates" "of"          "a"          
 [6] "point"       "in"          "the"         "first"       "quadrant"   
[11] "that"        "will"       
[ ... and 14 more ]

text2 :
 [1] "match"       "numerical"   "expressions" "that"        "involve"    
 [6] "two"         "operations"  "with"        "equivalent"  "word"       
[11] "expressions" "."          

text3 :
 [1] "solve"       "a"           "real-world"  "problem"     "that"       
 [6] "involves"    "multiplying" "a"           "fraction"    "by"         
[11] "a"           "mixed"      
[ ... and 2 more ]

text4 :
 [1] "determine"   "the"         "volume"      "of"          "right"      
 [6] "rectangular" "prisms"      "by"          "counting"    "unit"       
[11] "cubes"       "and"        
[ ... and 6 more ]

text5 :
[1] "round"     "decimals"  "to"        "the"       "nearest"   "hundredth"
[7] "."        

text6 :
 [1] "estimate"  "the"       "sum"       "of"        "two"       "fractions"
 [7] "that"      "are"       "both"      "less"      "than"      "one"      
[ ... and 1 more ]

[ reached max_ndoc ... 1,004 more documents ]

Tokenization 2: P-N-L-W & Lemmatization

Code
# remove punctuation and numbers
Math_item_tokens2 <- tokens(Math_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

# remove stopwords

Math_item_tokens2 <- tokens_select(Math_item_tokens2,
                   pattern = stopwords("en"),
                  selection = "remove")

# lower case

Math_item_tokens2 <- tokens_tolower(Math_item_tokens2)




print(Math_item_tokens2)
Tokens consisting of 1,010 documents and 12 docvars.
text1 :
 [1] "determine"   "coordinates" "point"       "first"       "quadrant"   
 [6] "create"      "rectangle"   "first"       "three"       "points"     
[11] "rectangle"   "given"      

text2 :
[1] "match"       "numerical"   "expressions" "involve"     "two"        
[6] "operations"  "equivalent"  "word"        "expressions"

text3 :
[1] "solve"       "real-world"  "problem"     "involves"    "multiplying"
[6] "fraction"    "mixed"       "number"     

text4 :
 [1] "determine"   "volume"      "right"       "rectangular" "prisms"     
 [6] "counting"    "unit"        "cubes"       "using"       "volume"     
[11] "solve"       "problems"   

text5 :
[1] "round"     "decimals"  "nearest"   "hundredth"

text6 :
[1] "estimate"  "sum"       "two"       "fractions" "less"      "one"      

[ reached max_ndoc ... 1,004 more documents ]

lemmatization

Code
lem_Math_item_tokens2<-tokens_replace(Math_item_tokens2,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_Math_item_tokens2
Tokens consisting of 1,010 documents and 12 docvars.
text1 :
 [1] "determine"  "coordinate" "point"      "first"      "quadrant"  
 [6] "create"     "rectangle"  "first"      "three"      "point"     
[11] "rectangle"  "give"      

text2 :
[1] "match"      "numerical"  "expression" "involve"    "two"       
[6] "operation"  "equivalent" "word"       "expression"

text3 :
[1] "solve"      "real-world" "problem"    "involve"    "multiply"  
[6] "fraction"   "mix"        "numb"      

text4 :
 [1] "determine"   "volume"      "right"       "rectangular" "prism"      
 [6] "count"       "unit"        "cube"        "use"         "volume"     
[11] "solve"       "problem"    

text5 :
[1] "round"   "decimal" "near"    "100"    

text6 :
[1] "estimate" "sum"      "two"      "fraction" "little"   "one"     

[ reached max_ndoc ... 1,004 more documents ]

Create DFM

Code
df_Math_toks1<-dfm(Math_item_tokens1)


df_Math_toks1_smaller<- dfm_trim(df_Math_toks1, min_docfreq = 0.08, docfreq_type = "prop")


df_Math_toks1
Document-feature matrix of: 1,010 documents, 853 features (98.41% sparse) and 12 docvars.
       features
docs    determine the coordinates of a point in first quadrant that
  text1         1   4           1  2 2     1  1     2        1    1
  text2         0   0           0  0 0     0  0     0        0    1
  text3         0   0           0  0 3     0  0     0        0    1
  text4         1   1           0  1 0     0  0     0        0    0
  text5         0   1           0  0 0     0  0     0        0    0
  text6         0   1           0  1 0     0  0     0        0    1
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 843 more features ]
Code
df_Math_toks1_smaller
Document-feature matrix of: 1,010 documents, 27 features (73.95% sparse) and 12 docvars.
       features
docs    determine the of a in that given . two with
  text1         1   4  2 2  1    1     1 1   0    0
  text2         0   0  0 0  0    1     0 1   1    1
  text3         0   0  0 3  0    1     0 1   0    0
  text4         1   1  1 0  0    0     0 1   0    0
  text5         0   1  0 0  0    0     0 1   0    0
  text6         0   1  1 0  0    1     0 1   1    0
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 17 more features ]
Code
df_Math_toks2<-dfm(lem_Math_item_tokens2)


df_Math_toks_smaller<- dfm_trim(df_Math_toks2, min_docfreq = 0.08, docfreq_type = "prop")


df_Math_toks2
Document-feature matrix of: 1,010 documents, 577 features (98.64% sparse) and 12 docvars.
       features
docs    determine coordinate point first quadrant create rectangle three give
  text1         1          1     2     2        1      1         2     1    1
  text2         0          0     0     0        0      0         0     0    0
  text3         0          0     0     0        0      0         0     0    0
  text4         1          0     0     0        0      0         0     0    0
  text5         0          0     0     0        0      0         0     0    0
  text6         0          0     0     0        0      0         0     0    0
       features
docs    match
  text1     0
  text2     1
  text3     0
  text4     0
  text5     0
  text6     0
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 567 more features ]
Code
df_Math_toks_smaller
Document-feature matrix of: 1,010 documents, 17 features (83.40% sparse) and 12 docvars.
       features
docs    determine give expression two solve real-world problem fraction numb
  text1         1    1          0   0     0          0       0        0    0
  text2         0    0          2   1     0          0       0        0    0
  text3         0    0          0   0     1          1       1        1    1
  text4         1    0          0   0     1          0       1        0    0
  text5         0    0          0   0     0          0       0        0    0
  text6         0    0          0   1     0          0       0        1    0
       features
docs    use
  text1   0
  text2   0
  text3   0
  text4   1
  text5   0
  text6   0
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 7 more features ]

n-grams

Code
 Sys.unsetenv("GITHUB_PAT")
devtools::install_github("slanglab/phrasemachine/R/phrasemachine")
Using GitHub PAT from the git credential store.
Skipping install of 'phrasemachine' from a github remote, the SHA1 (41cec3b7) has not changed since last install.
  Use `force = TRUE` to force installation
Code
#   
library(phrasemachine) 
phrasemachine: Simple Phrase Extraction
Version 1.2.0 created on 2017-05-29.
copyright (c) 2016, Matthew J. Denny, Abram Handler, Brendan O'Connor.
Type help('phrasemachine') or
vignette('getting_started_with_phrasemachine') to get started.
Development website: https://github.com/slanglab/phrasemachine
Code
documents_weak_items<-as.character(corpus_subset(Math_item_corpus, school_state_diff < -2))[26:29]

#documents_weak_items

phrases<-phrasemachine(documents_weak_items, minimum_ngram_length = 2,
                         maximum_ngram_length = 4,
                         return_phrase_vectors = TRUE,
                         return_tag_sequences = TRUE)
Currently tagging document 1 of 4 
Currently tagging document 2 of 4 
Currently tagging document 3 of 4 
Currently tagging document 4 of 4 
Extracting phrases from document 1 of 4 
Extracting phrases from document 2 of 4 
Extracting phrases from document 3 of 4 
Extracting phrases from document 4 of 4 
Code
# look at some example phrases
print(phrases[[1]]$phrases[1:20])
 [1] "Place_parentheses"            "parentheses_in_an_expression"
 [3] "expression_equivalent"        NA                            
 [5] NA                             NA                            
 [7] NA                             NA                            
 [9] NA                             NA                            
[11] NA                             NA                            
[13] NA                             NA                            
[15] NA                             NA                            
[17] NA                             NA                            
[19] NA                             NA                            
Code
documents_weak_items
                                                                                                                                                                                text84 
                                                                                              "Place parentheses in an expression to make the expression equivalent to a given value." 
                                                                                                                                                                                text85 
                                      "Determine the number of cubes that are used to create a figure when the length of one edge of the cube and the volume of the figure are given." 
                                                                                                                                                                                text93 
"Graph a given ordered pair on a coordinate plane, give the ordered pair of a point on a coordinate plane, and interpret coordinate values of points in the context of the situation." 
                                                                                                                                                                                text95 
                                                                                                                                 "Identify equivalent numerical and word expressions." 

Subset: Worst performing Items

Code
Math_worst_item_corpus<- corpus_subset(Math_item_corpus, school_state_diff < 2 )

# remove punctuation and numbers
Math_worst_tokens <- tokens(Math_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

Math_worst_tokens <- tokens_tolower(Math_worst_tokens)


Math_worst_tokens <-  tokens_select(Math_worst_tokens,
                  pattern = stopwords("en"),
                 selection = "remove")

lem_Math_worst_tokens<-tokens_replace(Math_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_Math_worst_tokens<-dfm(lem_Math_worst_tokens)
df_Math_worst_toks_smaller<- dfm_trim(df_Math_worst_tokens, min_docfreq = 0.08, docfreq_type = "prop")

Subset: Best performing Items

Code
Math_best_item_corpus<- corpus_subset(Math_item_corpus, school_state_diff > 5 )

# remove punctuation and numbers
Math_best_tokens <- tokens(Math_best_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

Math_best_tokens <- tokens_tolower(Math_best_tokens)


Math_best_tokens <-  tokens_select(Math_best_tokens,
                  pattern = stopwords("en"),
                 selection = "remove")

lem_Math_best_tokens<-tokens_replace(Math_best_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_Math_best_tokens<-dfm(lem_Math_best_tokens)
df_Math_best_toks_smaller<- dfm_trim(df_Math_best_tokens, min_docfreq = 0.08, docfreq_type = "prop")

Descriptive Analysis

Interestingly, some words like “determine”, “real-world”, and “expression” are top features from the Math tokens as a whole but, in the tokens taken only from the worst performing items, the word expression appears much more frequently than the phrase “real-world”, and this is the opposite in our best performing items and the entire corpus. Also, the word “graph” is a top-feature in the Math corpus and the best performing items but not a top-feature in the worst performing items.

All Items: Top Features

Code
topfeatures(df_Math_toks2, 20)
 determine       give real-world        use expression    problem      solve 
       409        330        278        225        212        206        195 
  equation       numb  represent   identify      value        two    context 
       150        146        143        120        114        113        108 
     graph     number coordinate   fraction       line    involve 
       103        100         94         94         92         77 
Code
#topfeatures(df_Math_toks1, 20)

Worstperforming Items: Top Features

Code
topfeatures(df_Math_worst_toks_smaller, 20)
 determine       give expression real-world        use       numb    problem 
       163        134        115         82         82         72         63 
     solve  represent   identify   equation        two      value equivalent 
        56         53         50         49         48         46         42 
    linear 
        36 

Best performing Items: Top Features

Code
topfeatures(df_Math_best_toks_smaller, 20)
   determine         give   real-world          use        solve      problem 
         173          133          133           94           89           89 
    equation   expression    represent      context        value         numb 
          72           70           63           56           54           51 
       graph          two         line     identify relationship   coordinate 
          51           46           44           43           43           42 
    fraction      involve 
          38           36 

Visualizations

I found it particularly interesting when I saw the connections on the worst performing subset vs. the best performing subset. There appears to be a theme of identifying equivalent expressions that only appears in the worst subset. Solving “real-world” problems; i.e. Modeling, appears strongly in both our best and worst subset. This makes me curious if there are certain mathematical topics where our students are struggling with the modeling problems.

Because of these patterns, I will create a dictionary to classify identify if an item requires Modeling, apply_division, or patterns_algebra.

Word Cloud: All Questions

Code
#dfm_item
smaller_dfm <-dfm_trim(df_Math_toks2, min_termfreq = 10)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.02, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm)

All questions

Clearly, mathematical modeling is a significant skill throughout the Math exam as the network “solve real-world problem” and the connection “real-world context” are very strong.

The word “expression”

Code
### Word connections for all items

dim(df_Math_toks_smaller)
[1] 1010   17
Code
# compute size weight for vertices in network
size <- log(colSums(df_Math_toks_smaller))

# create plot
textplot_network(df_Math_toks_smaller, vertex_size = size/ max(size) * 3)

Worst-performing questions Network Plot

The connection “equivalent expression” and the connections to the word “expression” are much stronger in the network of students’ weakest performing items. Noticeable, the words “context” and “represent” are not present as strong connectors to the word “expression” however they are in the best items network. This signals to me that that the algebraic manipulation of expressions is a weakness rather than constructing an expression in a mathematical modeling context.

The network “solve real-world problem” is also strong in our weakest performing item descriptions but not as heavily emphasized as in our strongest performing items.

Code
### Word connections for worst items

dim(df_Math_worst_toks_smaller)
[1] 401  15
Code
# compute size weight for vertices in network
size <- log(colSums(df_Math_worst_toks_smaller))

# create plot
textplot_network(df_Math_worst_toks_smaller, vertex_size = size/ max(size) * 3)

Best-performing questions

We can see the clear thread of “solve real-world problems” is strongly present in our students’ best performing items’ descriptions.

Also, I can see the connection “determine expression” present as well as “represent” and “context”. These strike me as words connected with constructing models and identifying mathematical concepts in word problems.

Code
### Word connections for best items

dim(df_Math_best_toks_smaller)
[1] 422  20
Code
# compute size weight for vertices in network
size <- log(colSums(df_Math_best_toks_smaller))

# create plot
textplot_network(df_Math_best_toks_smaller, vertex_size = size/ max(size) * 3)

Dictionary to Create New Features

Dictionary: Key Math Terms:

Code
my_math_dict <- dictionary(list(modeling=c("real-world","context", "word_problem"),             
              patterns_alg=c("equivalent expression", "distributive", "identify equivalent", "distribute", "factor", "parentheses"),
              apply_division=c("divide","division","quotient", "quotients")))

my_math_dict
Dictionary object with 3 key entries.
- [modeling]:
  - real-world, context, word_problem
- [patterns_alg]:
  - equivalent expression, distributive, identify equivalent, distribute, factor, parentheses
- [apply_division]:
  - divide, division, quotient, quotients
Code
# patterns_alg=c("equivalent expression","radical expression", "rational expression", #"expression", "eqivalent", "distributive", "distribute", "factor")
Code
math_Toks1_mydict <- df_Math_toks1 %>%
  dfm_lookup(my_math_dict)

tail(math_Toks1_mydict, 10)
Document-feature matrix of: 10 documents, 3 features (86.67% sparse) and 12 docvars.
          features
docs       modeling patterns_alg apply_division
  text1001        0            0              0
  text1002        0            0              0
  text1003        1            0              0
  text1004        1            0              0
  text1005        0            0              0
  text1006        0            0              0
[ reached max_ndoc ... 4 more documents ]
Code
df_MathDict <- convert(math_Toks1_mydict, to = "data.frame")



df_MathDict
Code
#Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
# mathToksDFM_mydict <- df_Math_worst_tokens %>%
#   dfm_lookup(my_math_dict)
# 
# head(mathToksDFM_mydict, 10)
# 
# summary(df_Math_toks1)
Code
Math_DF<-mutate(Math_DF, ID = as.character(row_number()))

Math_DF<-mutate(Math_DF, doc_id = str_trim(paste("text", ID, sep="")))

Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
Math_DF<-Math_DF%>%
  mutate(modeling_factor = case_when(
    modeling == 0 ~ "no",
    modeling > 0 ~ "yes"
  ))%>%
  mutate(patterns_alg_factor = case_when(
    patterns_alg == 0 ~ "no",
    patterns_alg > 0 ~ "yes"
  ))%>%
  mutate(apply_division_factor = case_when(
    apply_division == 0 ~ "no",
    apply_division > 0 ~ "yes"
  ))

Modeling Visualization

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 10)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G10 MCAS: Student Performance Modeling by Reporting Category")

G5 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 5)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G5 MCAS: Student Performance Modeling by Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 5)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "GE")

G6 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 6)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Modeling by Reporting Category")

Code
#Math_DF
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "RP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)
Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "GE")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)
Code
# Math_DF%>%
#   filter(grade_level == 6)%>%
#   filter(modeling_factor == "yes")%>%
#   filter(`reporting category` == "SP")%>%
#   select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
#   arrange(`school_state_diff`)
#   

G7 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 7)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G7 MCAS: Student Performance Modeling by Reporting Category")

Code
Math_DF
Code
Math_DF%>%
  filter(grade_level == 7)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "EE")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

G8 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 8)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G8 MCAS: Student Performance Modeling by Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 8)%>%
  filter(modeling_factor == "no")%>%
  filter(`reporting category` == "SP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

Patterns_Alg Visual

The concept of equivalent expressions, the distributive property, and factoring seems to be emphasized and particularly challenging for our students in grades 6, 7, and 10. It is not heavily emphasized on the G8 exam.

Code
Math_DF%>%
  #filter(patterns_alg > 0)%>%
  filter(grade_level == 6 | grade_level == 7 | grade_level == 10)%>%
  ggplot( aes(x=`patterns_alg_factor`, y=`school_state_diff`, fill=`patterns_alg_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6, G7, G10 MCAS: Student Performance Patterns Algebra")#+

Code
 #facet_wrap(~`reporting category`)

G6

Code
Math_DF%>%
  #filter(patterns_alg > 0)%>%
  filter(grade_level == 6)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`patterns_alg_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Patterns Algebra and Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(patterns_alg_factor == "yes")%>%
 # filter(`reporting category` == "SP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

Apply Division Visual

G6

G6 is the when students are expected to be fluent in the standard algorithm for long division. Applying division seems to have been a factor in student success in some G6 reporting categories

Code
Math_DF%>%
  #filter(patterns_alg > 0)%>%
  filter(grade_level == 6)%>%
  ggplot( aes(x=`apply_division_factor`, y=`school_state_diff`, fill=`apply_division_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Apply Division")#+

Code
 #facet_wrap(~`grade_level`)
Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(`reporting category` == "NS" | `reporting category` == "EE")%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`apply_division_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Apply_Division and Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(apply_division_factor == "yes")%>%
  filter(patterns_alg_factor == "no")%>%
 # filter(`reporting category` == "SP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

Machine Learning Model to Predict school_state_diff

Create Training Set

Code
#Math_DF
Math_DF_Train_Full<-Math_DF%>%
  mutate(year = as.factor(year))%>%
  mutate(`school_state_bool` = case_when(
    `school_state_diff` <= 0 ~ "negative",
    `school_state_diff` > 0 ~ "positive")
  )%>%
  #filter(grade_level == 5)%>%
  filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
 mutate(id = row_number())%>%
  select(`doc_id`, `id`, `standard`, `reporting category`, `type`, `year`,  `modeling_factor`, `patterns_alg_factor`, `apply_division_factor`, `state%`, `school_state_bool`, `school_state_diff`, `grade_level`)

Math_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(Math_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
length(trainIndex)
[1] 808
Code
trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 808
Code
#dim(Math_DF_Test)

#Create training vector

trainIndexDF
Code
trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

trainIndexDF
Code
Math_DF_Train_Full
Code
Math_DF_Train<-left_join(trainIndexDF, Math_DF_Train_Full, by = "id")

Math_DF_Train

Create Test Set

Code
testIndexDF<-testIndex

#Math_DF_Test<-Math_DF_Test_Full[-Math_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
dim(testIndexDF)
[1]   1 202
Code
#dim(Math_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#Math_DF_Test_Full

Math_DF_Test<-left_join(testIndexDF, Math_DF_Train_Full, by = "id")

Random Forest Model

Base Regression

Cross Validation

Randoom forest with text features

Code
data_ctrl<-trainControl(method = "cv", number = 5)

Math_DF_Train_Full_Text<-Math_DF_Train_Full%>%
  select(`standard`, `reporting category`, `type`, `year`, `modeling_factor`, `state%`, `grade_level`, `school_state_diff` )

Math_DF_Train_Full_Base<-Math_DF_Train_Full%>%
  select(`standard`, `reporting category`, `type`, `year`, `state%`, `grade_level`, `school_state_diff` )
Code
model_caret_rf_text<-train(school_state_diff ~.
                      , data = Math_DF_Train_Full_Text,
                      trControl = data_ctrl,
                      method = "rf")
Code
model_caret_rf_text
Random Forest 

1010 samples
   7 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 808, 808, 808, 808, 808 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
    2   8.729335  0.02648748  6.918423
  104   8.779073  0.05711325  6.853889
  206   8.778752  0.05913127  6.860164

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
Code
varImpPlot(model_caret_rf_text$finalModel)

Code
model_caret_rf<-train(school_state_diff ~.
                      , data = Math_DF_Train_Full_Base,
                      trControl = data_ctrl,
                      method = "rf")
Code
model_caret_rf
Random Forest 

1010 samples
   6 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 807, 808, 809, 808, 808 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
    2   8.745757  0.02236964  6.930960
  103   8.699142  0.06545570  6.794238
  205   8.717517  0.06691320  6.824248

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 103.
Code
Math_DF_Train_X<-Math_DF_Train%>%
  select( `standard`, `reporting category`, `type`, `year`,  `state%`, grade_level)



Math_DF_Train_Y<-Math_DF_Train%>%
  select(`school_state_bool`)

Math_DF_Train_Y
Code
Math_DF_Test_X<-Math_DF_Test%>%
  select( `standard`, `reporting category`, `type`, `year`,  `state%`, grade_level)

Math_DF_Test_Y<-Math_DF_Test%>%
  select(`school_state_diff`)


#Math_DF_Test_Y
#Math_DF_Train_Y

#Math_DF_Train_X

#Math_DF_Test_X
Code
set.seed(444)

diff_RF<-randomForest(Math_DF_Train_X,
                      y = Math_DF_Train$school_state_diff,
                      xtest = Math_DF_Test_X,
                      ytest = Math_DF_Test$school_state_diff,
                      mtry =103,
                      importance = TRUE,
                      ntree = 1010,
                      type="regression")
Warning in randomForest.default(Math_DF_Train_X, y =
Math_DF_Train$school_state_diff, : invalid mtry: reset to within valid range
Code
diff_RF

Call:
 randomForest(x = Math_DF_Train_X, y = Math_DF_Train$school_state_diff,      xtest = Math_DF_Test_X, ytest = Math_DF_Test$school_state_diff,      ntree = 1010, mtry = 103, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 1010
No. of variables tried at each split: 6

          Mean of squared residuals: 76.37658
                    % Var explained: 7.4
                       Test set MSE: 56.63
                    % Var explained: 3.09

Regression w/ text

Code
Math_DF_Train_X<-Math_DF_Train%>%
  select( `standard`, `reporting category`, `type`, `year`, `state%`, `modeling_factor`, `patterns_alg_factor`, grade_level)



Math_DF_Train_Y<-Math_DF_Train%>%
  select(`school_state_diff`)

Math_DF_Train_Y
Code
Math_DF_Test_X<-Math_DF_Test%>%
  select( `standard`, `reporting category`, `type`, `year`, `state%`, `modeling_factor`, `patterns_alg_factor`, grade_level)

Math_DF_Test_Y<-Math_DF_Test%>%
  select(`school_state_diff`)
Code
set.seed(444)

diff_RF_text<-randomForest(Math_DF_Train_X,
                      y = Math_DF_Train$school_state_diff,
                      xtest = Math_DF_Test_X,
                      ytest = Math_DF_Test$school_state_diff,
                      mtry = 2,
                      importance = TRUE,
                      ntree = 1010,
                      type="regression")
Code
diff_RF_text

Call:
 randomForest(x = Math_DF_Train_X, y = Math_DF_Train$school_state_diff,      xtest = Math_DF_Test_X, ytest = Math_DF_Test$school_state_diff,      ntree = 1010, mtry = 2, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 1010
No. of variables tried at each split: 2

          Mean of squared residuals: 73.52679
                    % Var explained: 10.86
                       Test set MSE: 53.68
                    % Var explained: 8.15
Code
varImpPlot(diff_RF_text)

Support Vector Regression

to do address the issue that new standards appear in the test set and thus the svm can’t make a prediction

Code
set.seed(446)

Math_SVM_Train<-Math_DF_Train%>%
  select(`standard`, `reporting category`, `type`, `year`, `state%`, `modeling_factor`, `patterns_alg_factor`, `apply_division_factor`, grade_level, school_state_diff)
#Math_DF_Train_X
#Math_DF_Train$school_state_diff

svm_text<-svm(school_state_diff ~., data = Math_SVM_Train)
                     
svm_text

Call:
svm(formula = school_state_diff ~ ., data = Math_SVM_Train)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.004854369 
    epsilon:  0.1 


Number of Support Vectors:  744
Code
#svm_pred<-predict(svm_text, Math_DF_Test_X)

Science, Technology, and Engineering Exam (STE) Exam

Create Corpus

Code
STE_item_corpus <-corpus(STE_DF, text_field = "item description")

#print(STE_item)

#summary(STE_item_corpus)

#STE_DF

PreTextPreProcessing Decisions

Before completing the pre-processing process, I examined the different choices using pre-Text. I was surprised to see that removing stopwords had a positive correlation coefficient; yet the combination “P-W” had the lowest score of the Pre-text results. I can see how key logical words like “not”, “and”, and “or”, which are also stop words can have a significant impact on the meaning of an exam question. Perhaps, because each individual text is so small and the texts are designed for assessing content skills and are not narrative text, the stop words play more significant roles?

Given, these results, I will pre-process the data two ways, once using the lowest recommended score, P-W and once using methods that includes, lowercase and lemmatization of words to use for creating visualizations

  • Recommended preText score: “P-W” (remove punctuation and stopwords)

  • Alternative approach: “P-N-L-W” + Lemmatization (remove punctuation, remove numbers, lower case, remove stopwords and lemmatization)

Code
# Sys.unsetenv("GITHUB_PAT")
# devtools::install_github("matthewjdenny/preText")
# library(preText)
Code
# 
# preprocessed_documents_STE <- factorial_preprocessing(
#     STE_item_corpus,
#     use_ngrams = TRUE,
#     infrequent_term_threshold = 0.2,
#     verbose = FALSE)
Code
#names(preprocessed_documents_STE)

#head(preprocessed_documents_STE$choices)
Code
# preText_results <- preText(
#     preprocessed_documents_STE,
#     dataset_name = "STE MCAS Item Descriptions",
#     distance_method = "cosine",
#     num_comparisons = 20,
#     verbose = FALSE)
# 
# 
# preText_score_plot(preText_results)
Code
# regression_coefficient_plot(preText_results,
#                             remove_intercept = TRUE)

Tokenization 1: P-W

Code
## Extract the tokens

STE_item_tokens <- tokens(STE_item_corpus)

#print(STE_item_tokens)
Code
STE_item_tokens1 <- tokens(STE_item_corpus, 
    remove_punct = T)

STE_item_tokens1 <- tokens_select(STE_item_tokens1,
                   pattern = stopwords("en"),
                  selection = "remove")

print(STE_item_tokens1)
Tokens consisting of 521 documents and 13 docvars.
text1 :
[1] "Describe"    "material"    "appropriate" "choice"      "part"       
[6] "bridge"     

text2 :
[1] "Describe" "change"   "particle" "model"    "show"     "atoms"    "thermal" 
[8] "energy"  

text3 :
[1] "Relate"    "amount"    "energy"    "seismic"   "wave"      "wave's"   
[7] "amplitude"

text4 :
 [1] "Analyze"      "model"        "bridge"       "determine"    "loads"       
 [6] "system"       "describe"     "advantage"    "disadvantage" "making"      
[11] "design"       "change"      

text5 :
 [1] "Interpret"  "model"      "two"        "oceanic"    "plates"    
 [6] "colliding"  "identify"   "geological" "feature"    "formed"    
[11] "describe"   "process"   
[ ... and 3 more ]

text6 :
 [1] "Complete"     "table"        "showing"      "percentages"  "genotypes"   
 [6] "cross"        "two"          "plants"       "heterozygous" "particular"  
[11] "trait"       

[ reached max_ndoc ... 515 more documents ]

Tokenization 2: P-N-L-W- Lemmatization

Code
# remove punctuation and numbers
STE_item_tokens2 <- tokens(STE_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

STE_item_tokens2 <- tokens_select(STE_item_tokens2,
                   pattern = stopwords("en"),
                  selection = "remove")

# lower case

STE_item_tokens2 <- tokens_tolower(STE_item_tokens2)


# remove stopwords

#print(STE_item_tokens2)

lemmatization

When I originally made word clouds, I noticed object and objects appearing separately as well as model and models. I believe these are important, so I chose to also do lemmatization

Code
lem_STE_item_tokens2<-tokens_replace(STE_item_tokens2,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_STE_item_tokens2
Tokens consisting of 521 documents and 13 docvars.
text1 :
[1] "describe"    "material"    "appropriate" "choice"      "part"       
[6] "bridge"     

text2 :
[1] "describe" "change"   "particle" "model"    "show"     "atom"     "thermal" 
[8] "energy"  

text3 :
[1] "relate"    "amount"    "energy"    "seismic"   "wave"      "wave's"   
[7] "amplitude"

text4 :
 [1] "analyze"      "model"        "bridge"       "determine"    "load"        
 [6] "system"       "describe"     "advantage"    "disadvantage" "make"        
[11] "design"       "change"      

text5 :
 [1] "interpret"  "model"      "two"        "oceanic"    "plate"     
 [6] "collide"    "identify"   "geological" "feature"    "form"      
[11] "describe"   "process"   
[ ... and 3 more ]

text6 :
 [1] "complete"     "table"        "show"         "percentage"   "genotype"    
 [6] "cross"        "two"          "plant"        "heterozygous" "particular"  
[11] "trait"       

[ reached max_ndoc ... 515 more documents ]

Create DFMs

Code
df_STE_toks2<-dfm(lem_STE_item_tokens2)

df_STE_toks1<-dfm(STE_item_tokens1)

Worst performing Items

Code
STE_worst_item_corpus<- corpus_subset(STE_item_corpus, school_state_diff < 2 )

# remove punctuation and numbers
STE_worst_tokens <- tokens(STE_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)



STE_worst_tokens<-STE_worst_tokens <-tokens_tolower(STE_worst_tokens)

STE_worst_tokens<- tokens_select(STE_worst_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")

lem_STE_worst_tokens<-tokens_replace(STE_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_STE_worst_toks<-dfm(lem_STE_worst_tokens)

df_STE_worst_toks_smaller<-dfm_trim(df_STE_worst_toks, min_docfreq = .05, docfreq_type = "prop")

MS only

Worst Items

Code
STE_MS_item_corpus<- corpus_subset(STE_item_corpus, grade_level != 9 )

#summary(STE_MS_item_corpus)


STE_MS_worst_item_corpus<-corpus_subset(STE_MS_item_corpus, school_state_diff < 2)

# remove punctuation and numbers
STE_MS_worst_tokens <- tokens(STE_MS_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)


STE_MS_worst_tokens<-STE_worst_tokens <-tokens_tolower(STE_MS_worst_tokens)

STE_MS_worst_tokens<- tokens_select(STE_MS_worst_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")


lem_STE_MS_worst_tokens<-tokens_replace(STE_MS_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)



df_STE_MS_worst_toks<-dfm(lem_STE_MS_worst_tokens)

df_STE_MS_worst_toks_smaller<-dfm_trim(df_STE_worst_toks, min_docfreq = .05, docfreq_type = "prop")

Best Items

Code
STE_MS_best_item_corpus<-corpus_subset(STE_MS_item_corpus, school_state_diff > 5)


# remove punctuation and numbers
STE_MS_best_tokens <- tokens(STE_MS_best_item_corpus, 
    remove_punct = T,
    remove_numbers = T)


STE_MS_best_tokens<-STE_best_tokens <-tokens_tolower(STE_MS_best_tokens)

STE_MS_best_tokens<- tokens_select(STE_MS_best_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")


lem_STE_MS_best_tokens<-tokens_replace(STE_MS_best_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)



df_STE_MS_best_toks<-dfm(lem_STE_MS_best_tokens)

df_STE_MS_best_toks_smaller<-dfm_trim(df_STE_MS_best_toks, min_docfreq = .06, docfreq_type = "prop")

#textplot_wordcloud(df_STE_MS_best_toks)
#textplot_wordcloud(df_STE_MS_best_toks_smaller)

Descriptive Analysis

Code
topfeatures(df_STE_toks2, 20)
determine  describe    object    change   explain       use     model  identify 
      222       133       125        98        97        95        91        85 
   energy   analyze     force       two    system      show interpret different 
       77        77        75        65        60        54        53        48 
     wave     graph      time   compare 
       45        45        44        43 
Code
topfeatures(df_STE_toks1, 20)
determine  describe   explain  identify    object   analyze    energy     model 
      222       130        97        85        84        75        74        73 
      two       use interpret    system different      time     force   objects 
       65        63        51        48        48        43        42        41 
  compare calculate   diagram   changes 
       41        41        40        40 
Code
topfeatures(df_STE_worst_toks_smaller, 20)
determine    object  describe    change     model    energy   analyze     force 
       96        68        54        44        41        39        34        32 
interpret       two calculate   explain  identify     graph      time       use 
       30        30        30        29        28        25        24        22 
  diagram   compare     datum different 
       19        19        18        17 
Code
topfeatures(df_STE_MS_worst_toks_smaller, 20)
determine    object  describe    change     model    energy   analyze     force 
       96        68        54        44        41        39        34        32 
interpret       two calculate   explain  identify     graph      time       use 
       30        30        30        29        28        25        24        22 
  diagram   compare     datum different 
       19        19        18        17 
Code
topfeatures(df_STE_MS_best_toks_smaller, 20)
   determine          use     describe      explain      analyze        model 
          84           58           48           46           34           34 
    identify       system    different     organism       change         show 
          34           32           26           26           24           22 
         two       object        plant   population         part       energy 
          22           22           20           20           18           18 
relationship          can 
          18           18 

Visualizations

It was evident that the worst item tokens seem to largely represent text of items in the physical sciences. Our 9th grade students take a science exam that is only on Physics and as you can see below, they have performed weaker against their peers in the state compared to Rising Tide students at other grade levels. It would therefore, be worth exploring the worst items for Middle School (5th-8th grade) students. When I look at the Mifdle School student performance by Reporting Category, one of their weakest categories appears to be Physical Sciences. So this is also contributing to the dominance of Physical Science terms in the worst items tokens.

Code
#STE_DF

STE_DF%>%
  filter(grade_level != 9)%>%
  group_by(`reporting category`)%>%
  summarize(mean_diff = mean(school_state_diff))
Code
STE_DF%>%
  #filter(grade_level != 9)%>%
  #filter(Course == "STEM"| Course == "Math" & `Reporting Category` == "Geometry")%>%
  #filter(`Cluster` != "Convert like measurement units within a given measurement system.")%>%
  #filter(`Cluster` != "Geometric measurement: Understand concepts of volume and relate volume to multiplication and to addition.")%>%
  #filter(`year` == 2024 | year == 2019)%>%
  ggplot( aes(x=`grade_level`, y=`school_state_diff`, fill=`grade_level`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("STE MCAS: Student Performance by Grade Level") +
    xlab("")#+

Code
STE_DF%>%
  filter(grade_level != 9)%>%
  #filter(`year` == 2024 | year == 2019)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`reporting category`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("MS STE MCAS: Student Performance by Reporting Category") +
    xlab("")#+

Code
STE_DF%>%
  filter(grade_level != 9)%>%
  #filter(Course == "STEM"| Course == "Math" & `Reporting Category` == "Geometry")%>%
  filter(`reporting category` == "PS")%>%
  #filter(`Cluster` != "Geometric measurement: Understand concepts of volume and relate volume to multiplication and to addition.")%>%
  #filter(`year` == 2024 | year == 2019)%>%
  ggplot( aes(x=`discipline core idea`, y=`school_state_diff`, fill=`discipline core idea`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
     # legend.position="none",
      axis.text.x=element_blank(),
      plot.title = element_text(size=11)
    ) +
    ggtitle("MS STE MCAS: Student Performance in Physical Sciences") +
    xlab("")#+

Word Cloud

The word cloud has Physical science words larger than other subjects. This also suggests a need to explore the Middle School separately as the 5-8th grade exams give equal weight to Physical Sciences, Life Science, Earth Science, and Technology and Engineering.

Code
#dfm_item


textplot_wordcloud(df_STE_toks2)

Code
smaller_dfm <-dfm_trim(df_STE_toks2, min_termfreq = 10)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.07, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm, min_count =3, random_order = FALSE)

Examine text visuals on worst-performing questions

Code
STE_worst_item_corpus<- corpus_subset(STE_item_corpus, school_state_diff < 0 )

# remove punctuation and numbers
STE_worst_tokens <- tokens(STE_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)



STE_worst_tokens<-STE_worst_tokens <-tokens_tolower(STE_worst_tokens)

STE_worst_tokens<- tokens_select(STE_worst_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")

lem_STE_worst_tokens<-tokens_replace(STE_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_STE_worst_toks<-dfm(lem_STE_worst_tokens)

df_STE_worst_toks_smaller<-dfm_trim(df_STE_worst_toks, min_docfreq = .06, docfreq_type = "prop")





textplot_wordcloud(df_STE_worst_toks)

Code
textplot_wordcloud(df_STE_worst_toks_smaller)

Examine text visual on best-performing questions

Code
STE_best_item_corpus<- corpus_subset(STE_item_corpus, school_state_diff > 2 )

# remove punctuation and numbers
STE_best_tokens <- tokens(STE_best_item_corpus, 
    remove_punct = T,
    remove_numbers = T)



STE_best_tokens<-STE_best_tokens <-tokens_tolower(STE_best_tokens)

STE_best_tokens<- tokens_select(STE_best_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")


lem_STE_best_tokens<-tokens_replace(STE_best_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_STE_best_toks<-dfm(lem_STE_best_tokens)

df_STE_best_toks_smaller<-dfm_trim(df_STE_best_toks, min_docfreq = .07, docfreq_type = "prop")





textplot_wordcloud(df_STE_best_toks)

Code
textplot_wordcloud(df_STE_best_toks_smaller)

Word Connections

Worst items

Code
dim(df_STE_worst_toks_smaller)
[1] 148  31
Code
# compute size weight for vertices in network
size <- log(colSums(df_STE_worst_toks_smaller))

# create plot
textplot_network(df_STE_worst_toks_smaller, vertex_size = size/ max(size) * 3)

Best Items

Code
# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created

dim(df_STE_best_toks_smaller)
[1] 301  20
Code
# compute size weight for vertices in network
size <- log(colSums(df_STE_best_toks_smaller))

# create plot
textplot_network(df_STE_best_toks_smaller, vertex_size = size/ max(size) * 3)

Word connections for all items

Code
# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created
smaller_fcm <- fcm(smaller_dfm)
dim(df_STE_best_toks_smaller)
[1] 301  20
Code
# compute size weight for vertices in network
size <- log(colSums(smaller_fcm))

# create plot
textplot_network(smaller_dfm, vertex_size = size/ max(size) * 3)

MS Worst

Code
# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created
#smaller_fcm <- fcm(smaller_dfm)
dim(df_STE_best_toks_smaller)
[1] 301  20
Code
# compute size weight for vertices in network
size <- log(colSums(smaller_fcm))

# create plot
textplot_network(smaller_dfm, vertex_size = size/ max(size) * 3)

MS Best

Code
#df_STE_MS_best_toks_smaller

dim(df_STE_MS_best_toks_smaller)
[1] 206  26
Code
# compute size weight for vertices in network
size <- log(colSums(df_STE_MS_best_toks_smaller))

# create plot
textplot_network(df_STE_MS_best_toks_smaller, vertex_size = size/ max(size) * 3)

Dictionary to Create New Features

Dictionary: Key STE Terms:

Code
my_ste_dict <- dictionary(list(model=c("graph","interpret graph", "model", "diagram"),             
              algebra=c("calculate")))

my_ste_dict
Dictionary object with 2 key entries.
- [model]:
  - graph, interpret graph, model, diagram
- [algebra]:
  - calculate
Code
# patterns_alg=c("equivalent expression","radical expression", "rational expression", #"expression", "eqivalent", "distributive", "distribute", "factor")
Code
STE_Toks1_mydict <- df_STE_toks1 %>%
  dfm_lookup(my_ste_dict)

tail(STE_Toks1_mydict, 10)
Document-feature matrix of: 10 documents, 2 features (75.00% sparse) and 13 docvars.
         features
docs      model algebra
  text512     0       0
  text513     0       0
  text514     0       0
  text515     0       0
  text516     1       1
  text517     0       0
[ reached max_ndoc ... 4 more documents ]
Code
df_STEDict <- convert(STE_Toks1_mydict, to = "data.frame")



df_STEDict
Code
#Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
# mathToksDFM_mydict <- df_Math_worst_tokens %>%
#   dfm_lookup(my_math_dict)
# 
# head(mathToksDFM_mydict, 10)
# 
# summary(df_Math_toks1)
Code
STE_DF<-mutate(STE_DF, ID = as.character(row_number()))

STE_DF<-mutate(STE_DF, doc_id = str_trim(paste("text", ID, sep="")))

STE_DF<-left_join(df_STEDict, STE_DF, by = "doc_id")

STE_DF
Code
STE_DF<-STE_DF%>%
  mutate(algebra_factor = case_when(
    algebra == 0 ~ "no",
    algebra > 0 ~ "yes"
  ))%>%
  mutate(model_factor = case_when(
    model == 0 ~ "no",
    model > 0 ~ "yes"
  ))

Machine Learning base model

Code
STE_DF
Code
STE_DF_Train_Full<-STE_DF%>%
   mutate(year = as.factor(year))%>%
#   mutate(`school_state_bool` = case_when(
#     `school_state_diff` <= 0 ~ "negative",
#     `school_state_diff` > 0 ~ "positive")
#   )%>%
#  filter(grade_level == 10)%>%
#   filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
  mutate(id = row_number())%>%
   select(`id`, `standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `school_state_diff`, `grade_level`)
# 
 STE_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(STE_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
length(trainIndex)
[1] 416
Code
trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 416
Code
#dim(Math_DF_Test)

#Create training vector

trainIndexDF
Code
trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

trainIndexDF
Code
STE_DF_Train_Full
Code
STE_DF_Train<-left_join(trainIndexDF, STE_DF_Train_Full, by = "id")

STE_DF_Train
Code
STE_DF_Train_X<-STE_DF_Train%>%
  select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `grade_level`)

STE_DF_Train_Y<-STE_DF_Train%>%
  select(`school_state_diff`)

STE_DF_Train_Y

Create Test Set

Code
testIndexDF<-testIndex

#STE_DF_Test<-STE_DF_Test_Full[-STE_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
dim(testIndexDF)
[1]   1 105
Code
#dim(STE_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#STE_DF_Test_Full

STE_DF_Test<-left_join(testIndexDF, STE_DF_Train_Full, by = "id")

STE_DF_Test_X<-STE_DF_Test%>%
select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `grade_level`)


STE_DF_Test_Y<-STE_DF_Test%>%
  select(`school_state_diff`)


STE_DF_Test_Y
Code
STE_DF_Train_Y
Code
STE_DF_Train_X
Code
STE_DF_Test_X

Random Forest Model

Regression

Code
set.seed(444)

STE_diff_RF<-randomForest(STE_DF_Train_X,
                      y = STE_DF_Train$school_state_diff,
                      xtest = STE_DF_Test_X,
                      ytest = STE_DF_Test$school_state_diff,
                      mtry = 3,
                      importance = TRUE,
                      type="regression",
                      ntree = 10000)


STE_diff_RF

Call:
 randomForest(x = STE_DF_Train_X, y = STE_DF_Train$school_state_diff,      xtest = STE_DF_Test_X, ytest = STE_DF_Test$school_state_diff,      ntree = 10000, mtry = 3, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 10000
No. of variables tried at each split: 3

          Mean of squared residuals: 30.77792
                    % Var explained: 54.62
                       Test set MSE: 44.47
                    % Var explained: 28.7
Code
STE_diff_RF2<-randomForest(STE_DF_Train_X,
                      y = STE_DF_Train$school_state_diff,
                     # xtest = STE_DF_Test_X,
                      mtry = 3,
                      importance = TRUE,
                      type="regression",
                      ntree = 10000)

 predicted<-predict(STE_diff_RF2, STE_DF_Test_X)
 mae(predicted, STE_DF_Test$school_state_diff)
[1] 4.93439
Code
# 
 actual <- STE_DF_Test$school_state_diff
# 
# 
# 
# 
 R2 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))
# 
 R2
[1] 0.2855837
Code
# 
 mean(abs(actual-predicted))
[1] 4.93439
Code
# 
mean(abs(actual))
[1] 6.609524
Code
sd(abs(actual))
[1] 5.467553
Code
sd(abs(actual-predicted))
[1] 4.481737
Code
varImpPlot(STE_diff_RF)

Model w/ text features

Code
#STE_DF
STE_DF_Train_Full<-STE_DF%>%
   mutate(year = as.factor(year))%>%
#   mutate(`school_state_bool` = case_when(
#     `school_state_diff` <= 0 ~ "negative",
#     `school_state_diff` > 0 ~ "positive")
#   )%>%
#  filter(grade_level == 10)%>%
#   filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
  mutate(id = row_number())%>%
   select(`id`, `standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `school_state_diff`, `grade_level`, model_factor, algebra_factor)
# 
 STE_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(STE_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
length(trainIndex)
[1] 416
Code
trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 416
Code
#dim(Math_DF_Test)

#Create training vector

trainIndexDF
Code
trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

trainIndexDF
Code
STE_DF_Train_Full
Code
STE_DF_Train<-left_join(trainIndexDF, STE_DF_Train_Full, by = "id")

STE_DF_Train
Code
STE_DF_Train_X<-STE_DF_Train%>%
  select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `state%`, `grade_level`, model_factor,  type)

STE_DF_Train_Y<-STE_DF_Train%>%
  select(`school_state_diff`)

STE_DF_Train_Y

Create Test Set

Code
testIndexDF<-testIndex

#STE_DF_Test<-STE_DF_Test_Full[-STE_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
dim(testIndexDF)
[1]   1 105
Code
#dim(STE_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#STE_DF_Test_Full

STE_DF_Test<-left_join(testIndexDF, STE_DF_Train_Full, by = "id")

STE_DF_Test_X<-STE_DF_Test%>%
select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`,  `state%`, `grade_level`, model_factor, type)


STE_DF_Test_Y<-STE_DF_Test%>%
  select(`school_state_diff`)


STE_DF_Test_Y
Code
STE_DF_Train_Y
Code
STE_DF_Train_X
Code
STE_DF_Test_X

Random Forest Model w/ text features

Regression

Code
set.seed(444)

STE_diff_RF_Text<-randomForest(STE_DF_Train_X,
                      y = STE_DF_Train$school_state_diff,
                      xtest = STE_DF_Test_X,
                      ytest = STE_DF_Test$school_state_diff,
                      mtry = 3,
                      importance = TRUE,
                      type="regression",
                      ntree = 10000)


STE_diff_RF_Text

Call:
 randomForest(x = STE_DF_Train_X, y = STE_DF_Train$school_state_diff,      xtest = STE_DF_Test_X, ytest = STE_DF_Test$school_state_diff,      ntree = 10000, mtry = 3, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 10000
No. of variables tried at each split: 3

          Mean of squared residuals: 30.61591
                    % Var explained: 54.86
                       Test set MSE: 42.68
                    % Var explained: 31.57
Code
varImpPlot(STE_diff_RF_Text)

English Language Arts Exam

Create Corpus

Code
ELA_item_corpus <-corpus(ELA_DF, text_field = "item description")

#print(ELA_item)

summary(ELA_item_corpus)
Code
#ELA_DF

PreText PreProcessing Decisions

Before completing the pre-processing process, I examined the different choices using pre-Text. I was surprised to see that removing stopwords had a positive correlation coefficient; yet the combination “P-W” had the lowest score of the Pre-text results. I can see how key logical words like “not”, “and”, and “or”, which are also stop words can have a significant impact on the meaning of an exam question. Perhaps, because each individual text is so small and the texts are designed for assessing content skills and are not narrative text, the stop words play more significant roles?

Given, these results, I will pre-process the data two ways, once using the lowest recommended score and once using methods that I suspect based on my background knowledge of the topic and the individual regression coefficients should not impact the meaning of the text and reduce the number of tokens for analysis:

  • Recommended preText score: “S-W-3” (remove stop words, Lemmatization/Stemming)

  • Alternative approach: “P-N-W-L” and Lemmatization (remove punctuation, remove numbers, lower case, and lemmatization)

Also, due to teacher input I am also creating a subcorpus of items that assessed Literature and items that assessed Informational text. Teachers describe having to change their approach to teaching these things and students having different weaknesses according to text type.

Code
# Sys.unsetenv("GITHUB_PAT")
# devtools::install_github("matthewjdenny/preText")
# library(preText)
Code
# preprocessed_documents_ELA <- factorial_preprocessing(
#     ELA_item_corpus,
#     use_ngrams = TRUE,
#     infrequent_term_threshold = 0.2,
#     verbose = FALSE)
Code
# names(preprocessed_documents_ELA)
# 
# head(preprocessed_documents_ELA$choices)
Code
# preText_results <- preText(
#     preprocessed_documents_ELA,
#     dataset_name = "ELA MCAS Item Descriptions",
#     distance_method = "cosine",
#     num_comparisons = 20,
#     verbose = FALSE)
# 
# 
# preText_score_plot(preText_results)
Code
# regression_coefficient_plot(preText_results,
#                             remove_intercept = TRUE)

Tokenization 1: L-W-Lemma

Once we learn more about n-grams, I would like to also incorporate n-grams in my preprocessing.

Code
## Extract the tokens

ELA_item_tokens <- tokens(ELA_item_corpus)

print(ELA_item_tokens)
Tokens consisting of 693 documents and 13 docvars.
text1 :
 [1] "Determine"  "the"        "effect"     "of"         "the"       
 [6] "repetition" "of"         "a"          "word"       "using"     
[11] "evidence"   "from"      
[ ... and 3 more ]

text2 :
 [1] "Determine" "what"      "event"     "will"      "happen"    "next"     
 [7] "using"     "evidence"  "from"      "the"       "passage"   "."        

text3 :
 [1] "Determine"  "the"        "effect"     "of"         "figurative"
 [6] "language"   "in"         "a"          "passage"    "."         

text4 :
 [1] "Identify"   "the"        "purpose"    "of"         "an"        
 [6] "author's"   "use"        "of"         "literary"   "techniques"
[11] "in"         "a"         
[ ... and 2 more ]

text5 :
 [1] "Make"        "an"          "inference"   "about"       "how"        
 [6] "mood"        "is"          "established" "in"          "a"          
[11] "passage"     "."          

text6 :
 [1] "Determine" "the"       "effect"    "of"        "the"       "structure"
 [7] "of"        "a"         "section"   "in"        "a"         "passage"  
[ ... and 1 more ]

[ reached max_ndoc ... 687 more documents ]
Code
ELA_item_tokens1 <- tokens_tolower(ELA_item_tokens)

ELA_item_tokens1 <- tokens_select(ELA_item_tokens1,
                   pattern = stopwords("en"),
                  selection = "remove")

print(ELA_item_tokens1)
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "using"     
[6] "evidence"   "passage"    "."         

text2 :
[1] "determine" "event"     "happen"    "next"      "using"     "evidence" 
[7] "passage"   "."        

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   
[6] "."         

text4 :
[1] "identify"   "purpose"    "author's"   "use"        "literary"  
[6] "techniques" "passage"    "."         

text5 :
[1] "make"        "inference"   "mood"        "established" "passage"    
[6] "."          

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"   "."        

[ reached max_ndoc ... 687 more documents ]

Tokenization 2: P-N-L-W & Lemmatization

Code
# remove punctuation and numbers
ELA_item_tokens2 <- tokens(ELA_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

# remove stopwords

ELA_item_tokens2 <- tokens_select(ELA_item_tokens2,
                   pattern = stopwords("en"),
                  selection = "remove")

# lower case

ELA_item_tokens2 <- tokens_tolower(ELA_item_tokens2)




print(ELA_item_tokens2)
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "using"     
[6] "evidence"   "passage"   

text2 :
[1] "determine" "event"     "happen"    "next"      "using"     "evidence" 
[7] "passage"  

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   

text4 :
[1] "identify"   "purpose"    "author's"   "use"        "literary"  
[6] "techniques" "passage"   

text5 :
[1] "make"        "inference"   "mood"        "established" "passage"    

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"  

[ reached max_ndoc ... 687 more documents ]

lemmatization

Code
lem_ELA_item_tokens2<-tokens_replace(ELA_item_tokens2,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_ELA_item_tokens2
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "use"       
[6] "evidence"   "passage"   

text2 :
[1] "determine" "event"     "happen"    "next"      "use"       "evidence" 
[7] "passage"  

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   

text4 :
[1] "identify"  "purpose"   "author's"  "use"       "literary"  "technique"
[7] "passage"  

text5 :
[1] "make"      "inference" "mood"      "establish" "passage"  

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"  

[ reached max_ndoc ... 687 more documents ]
Code
lem_ELA_item_tokens1<-tokens_replace(ELA_item_tokens1,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_ELA_item_tokens1
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "use"       
[6] "evidence"   "passage"    "."         

text2 :
[1] "determine" "event"     "happen"    "next"      "use"       "evidence" 
[7] "passage"   "."        

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   
[6] "."         

text4 :
[1] "identify"  "purpose"   "author's"  "use"       "literary"  "technique"
[7] "passage"   "."        

text5 :
[1] "make"      "inference" "mood"      "establish" "passage"   "."        

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"   "."        

[ reached max_ndoc ... 687 more documents ]

Create DFMs

Code
df_ELA_toks2<-dfm(lem_ELA_item_tokens2)

df_ELA_toks1<-dfm(lem_ELA_item_tokens1)

Worst performing Items

Literature

Code
ELA_worst_lit_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff < 2 & text_type == "Literature" )

# remove punctuation and numbers
ELA_worst_lit_tokens <- tokens(ELA_worst_lit_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_worst_lit_tokens <- tokens_tolower(ELA_worst_lit_tokens)


ELA_worst_lit_tokens <-  tokens_select(ELA_worst_lit_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_worst_lit_tokens<-tokens_replace(ELA_worst_lit_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_worst_lit_tokens<-dfm(lem_ELA_worst_lit_tokens)
df_ELA_worst_lit_toks_smaller<- dfm_trim(df_ELA_worst_lit_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Informational

Code
ELA_worst_inf_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff < 2 & text_type == "Informational" )

# remove punctuation and numbers
ELA_worst_inf_tokens <- tokens(ELA_worst_inf_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_worst_inf_tokens <- tokens_tolower(ELA_worst_inf_tokens)


ELA_worst_inf_tokens <-  tokens_select(ELA_worst_inf_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_worst_inf_tokens<-tokens_replace(ELA_worst_inf_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_worst_inf_tokens<-dfm(lem_ELA_worst_inf_tokens)
df_ELA_worst_inf_toks_smaller<- dfm_trim(df_ELA_worst_inf_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Best performing Items

Literature

Code
ELA_best_lit_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff > 5 & text_type == "Literature" )

# remove punctuation and numbers
ELA_best_lit_tokens <- tokens(ELA_best_lit_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_best_lit_tokens <- tokens_tolower(ELA_best_lit_tokens)


ELA_best_lit_tokens <-  tokens_select(ELA_best_lit_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_best_lit_tokens<-tokens_replace(ELA_best_lit_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_best_lit_tokens<-dfm(lem_ELA_best_lit_tokens)
df_ELA_best_lit_toks_smaller<- dfm_trim(df_ELA_best_lit_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Informational

Code
ELA_best_inf_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff > 5 & text_type == "Informational" )

# remove punctuation and numbers
ELA_best_inf_tokens <- tokens(ELA_best_inf_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_best_inf_tokens <- tokens_tolower(ELA_best_inf_tokens)


ELA_best_inf_tokens <-  tokens_select(ELA_best_inf_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_best_inf_tokens<-tokens_replace(ELA_best_inf_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_best_inf_tokens<-dfm(lem_ELA_best_inf_tokens)
df_ELA_best_inf_toks_smaller<- dfm_trim(df_ELA_best_inf_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Descriptive Analysis

Code
topfeatures(df_ELA_toks2, 20)
  passage determine  identify   analyze character      idea   article   support 
      388       292       171       142       133       111       105        96 
     mean       two paragraph       use  specific      poem      word    detail 
       92        88        87        85        85        83        80        78 
  excerpt  evidence inference      make 
       77        74        72        68 
Code
topfeatures(df_ELA_toks1, 20)
        .   passage determine  identify   analyze character      idea   article 
      694       388       292       171       142       133       111       105 
  support      mean       two paragraph       use  specific      poem      word 
       96        92        88        87        85        85        83        80 
   detail   excerpt  evidence inference 
       78        77        74        72 
Code
topfeatures(df_ELA_worst_lit_toks_smaller, 20)
  passage determine      poem character   analyze  identify   excerpt    detail 
       45        38        23        22        21        19        18        16 
      two  specific       use paragraph      make inference      base      idea 
       14        13        12        12        12        12        11        10 
    theme   develop 
        9         9 
Code
topfeatures(df_ELA_best_lit_toks_smaller, 20)
  passage determine character   analyze  identify    detail inference      make 
      134        83        72        45        41        32        27        26 
      use      poem  specific      mean      base paragraph      word   support 
       25        25        25        24        23        21        20        18 
      two 
       18 
Code
topfeatures(df_ELA_best_inf_toks_smaller, 20)
  passage determine  identify   article      idea   support       two      mean 
       72        67        41        41        36        27        25        24 
  purpose   excerpt      word paragraph   similar   analyze     topic  evidence 
       22        22        21        21        21        21        20        19 
  context       use 
       16        14 
Code
topfeatures(df_ELA_worst_inf_toks_smaller, 20)
  article determine  identify   passage      idea   support  evidence paragraph 
       32        29        25        25        19        14        12        11 
      use   purpose  sentence   section      mean  specific   context   analyze 
       10        10         9         9         9         9         8         8 
      two     claim      word   similar 
        7         7         7         7 

Visualizations

Code
#ELA_DF

ELA_DF%>%
  group_by(`reporting category`)%>%
  summarize(mean_diff = mean(school_state_diff))
Code
ELA_DF%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`text_type`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle(" ELA MCAS: Student Performance by Reporting Category") 

Code
ELA_DF%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`Cluster`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("ELA MCAS: Student Performance by Reporting Category")

Code
ELA_DF%>%
  filter(text_type == "Literature")%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`Cluster`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("ELA MCAS: Student Performance by Reporting Category") +
    labs(caption = "Literature") 

Code
ELA_DF%>%
  filter(`reporting category` == "LA")%>%
  ggplot( aes(x=`Cluster`, y=`school_state_diff`, fill=`text_type`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11),
      axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    ggtitle("ELA MCAS: Student Performance by Language Cluster and Text Type") 

Word Cloud

Code
#dfm_item
textplot_wordcloud(df_ELA_toks2)

Code
smaller_dfm <-dfm_trim(df_ELA_toks2, min_termfreq = 5)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.05, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm, min_count =3, random_order = FALSE)

Word Connections

Literature

Best Items
Code
### Word connections for worst items

dim(df_ELA_best_lit_toks_smaller)
[1] 196  17
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_best_lit_toks_smaller))

# create plot
textplot_network(df_ELA_best_lit_toks_smaller, vertex_size = size/ max(size) * 3)

Worst Items
Code
### Word connections for worst items

dim(df_ELA_worst_lit_toks_smaller)
[1] 93 18
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_worst_lit_toks_smaller))

# create plot
textplot_network(df_ELA_worst_lit_toks_smaller, vertex_size = size/ max(size) * 3)

Word Connections

Informational

Visual Best Items
Code
### Word connections for worst items

dim(df_ELA_best_inf_toks_smaller)
[1] 146  18
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_best_inf_toks_smaller))

# create plot
textplot_network(df_ELA_best_inf_toks_smaller, vertex_size = size/ max(size) * 3)

Worst Items
Code
### Word connections for best items

dim(df_ELA_worst_inf_toks_smaller)
[1] 63 22
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_worst_inf_toks_smaller))

# create plot
textplot_network(df_ELA_worst_inf_toks_smaller, vertex_size = size/ max(size) * 3)

Dictionary to Create New Features

Dictionary: Key ELA Terms:

Code
my_ela_dict <- dictionary(list(poetry=c("poem","stanza", "lyric", "verse")))

my_ela_dict
Dictionary object with 1 key entry.
- [poetry]:
  - poem, stanza, lyric, verse
Code
# patterns_alg=c("equivalent expression","radical expression", "rational expression", #"expression", "eqivalent", "distributive", "distribute", "factor")
Code
ELA_Toks1_mydict <- df_ELA_toks1 %>%
  dfm_lookup(my_ela_dict)

tail(ELA_Toks1_mydict, 10)
Document-feature matrix of: 10 documents, 1 feature (100.00% sparse) and 13 docvars.
         features
docs      poetry
  text684      0
  text685      0
  text686      0
  text687      0
  text688      0
  text689      0
[ reached max_ndoc ... 4 more documents ]
Code
df_ELADict <- convert(ELA_Toks1_mydict, to = "data.frame")



df_ELADict
Code
#Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
# mathToksDFM_mydict <- df_Math_worst_tokens %>%
#   dfm_lookup(my_math_dict)
# 
# head(mathToksDFM_mydict, 10)
# 
# summary(df_Math_toks1)
Code
ELA_DF<-mutate(ELA_DF, ID = as.character(row_number()))

ELA_DF<-mutate(ELA_DF, doc_id = str_trim(paste("text", ID, sep="")))

ELA_DF<-left_join(df_ELADict, ELA_DF, by = "doc_id")

ELA_DF
Code
ELA_DF<-ELA_DF%>%
  mutate(poetry_factor = case_when(
    poetry == 0 ~ "no",
    poetry > 0 ~ "yes"
  ))

Machine Learning

Code
ELA_DF
Code
ELA_DF_Train_Full<-ELA_DF%>%
   mutate(year = as.factor(year))%>%
#   mutate(`school_state_bool` = case_when(
#     `school_state_diff` <= 0 ~ "negative",
#     `school_state_diff` > 0 ~ "positive")
#   )%>%
  filter(grade_level == 10| grade_level ==8)%>%
  # filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
  mutate(id = row_number())%>%
   select(`id`, `standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `school_state_diff`, `grade_level`, `poetry_factor`)
# 
 ELA_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(ELA_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
length(trainIndex)
[1] 222
Code
trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 222
Code
#dim(Math_DF_Test)

#Create training vector

trainIndexDF
Code
trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

trainIndexDF
Code
ELA_DF_Train_Full
Code
ELA_DF_Train<-left_join(trainIndexDF, ELA_DF_Train_Full, by = "id")

ELA_DF_Train

Create Test Set

Code
testIndexDF<-testIndex

#ELA_DF_Test<-ELA_DF_Test_Full[-ELA_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
dim(testIndexDF)
[1]  1 56
Code
#dim(ELA_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#ELA_DF_Test_Full

ELA_DF_Test<-left_join(testIndexDF, ELA_DF_Train_Full, by = "id")

Random Forest Model

Base Regression

Code
ELA_DF_Train_X<-ELA_DF_Train%>%
  select( `standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`)

ELA_DF_Train_Y<-ELA_DF_Train%>%
  select(`school_state_diff`)

ELA_DF_Train_Y
Code
ELA_DF_Test_X<-ELA_DF_Test%>%
select( `standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`)


ELA_DF_Test_Y<-ELA_DF_Test%>%
  select(`school_state_diff`)


ELA_DF_Test_Y
Code
ELA_DF_Train_Y
Code
ELA_DF_Train_X
Code
ELA_DF_Test_X
Code
set.seed(444)

ELA_diff_RF<-randomForest(ELA_DF_Train_X,
                      y = ELA_DF_Train$school_state_diff,
                      xtest = ELA_DF_Test_X,
                      ytest = ELA_DF_Test$school_state_diff,
                      mtry = 1,
                      importance = TRUE,
                      type="regression",
                      ntree = 1000)
Code
ELA_diff_RF

Call:
 randomForest(x = ELA_DF_Train_X, y = ELA_DF_Train$school_state_diff,      xtest = ELA_DF_Test_X, ytest = ELA_DF_Test$school_state_diff,      ntree = 1000, mtry = 1, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 1

          Mean of squared residuals: 30.73875
                    % Var explained: 1.22
                       Test set MSE: 40.24
                    % Var explained: -9.23
Code
#predicted<-predict(ELA_diff_RF, ELA_DF_Test_X)
#mae(predicted, ELA_DF_Test$school_state_diff)

#actual <- ELA_DF_Test$school_state_diff




#R2 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))

#R2

# mean(abs(actual-predicted))
# 
# mean(abs(actual))
# sd(abs(actual))
# sd(abs(actual-predicted))
Code
varImpPlot(ELA_diff_RF)

Regression w/text

Code
ELA_DF_Train_X<-ELA_DF_Train%>%
  select( `standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`, poetry_factor)

ELA_DF_Train_Y<-ELA_DF_Train%>%
  select(`school_state_diff`)

ELA_DF_Train_Y
Code
ELA_DF_Test_X<-ELA_DF_Test%>%
select( `standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`, poetry_factor)


ELA_DF_Test_Y<-ELA_DF_Test%>%
  select(`school_state_diff`)


ELA_DF_Test_Y
Code
ELA_DF_Train_Y
Code
ELA_DF_Train_X
Code
ELA_DF_Test_X
Code
set.seed(444)

ELA_diff_RF_text<-randomForest(ELA_DF_Train_X,
                      y = ELA_DF_Train$school_state_diff,
                      xtest = ELA_DF_Test_X,
                      ytest = ELA_DF_Test$school_state_diff,
                      mtry = 1,
                      importance = TRUE,
                      type="regression",
                      ntree = 1000)
Code
ELA_diff_RF_text

Call:
 randomForest(x = ELA_DF_Train_X, y = ELA_DF_Train$school_state_diff,      xtest = ELA_DF_Test_X, ytest = ELA_DF_Test$school_state_diff,      ntree = 1000, mtry = 1, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 1

          Mean of squared residuals: 30.73931
                    % Var explained: 1.21
                       Test set MSE: 39.64
                    % Var explained: -7.61
Code
#predicted<-predict(ELA_diff_RF, ELA_DF_Test_X)
#mae(predicted, ELA_DF_Test$school_state_diff)

#actual <- ELA_DF_Test$school_state_diff




#R2 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))

#R2

# mean(abs(actual-predicted))
# 
# mean(abs(actual))
# sd(abs(actual))
# sd(abs(actual-predicted))
Code
varImpPlot(ELA_diff_RF_text)

Support Vector Regression

###Base SVM Model

Code
ELA_SVM_Train<-ELA_DF_Train%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`, school_state_diff)
#Math_DF_Train_X
#Math_DF_Train$school_state_diff

svm_ela_base<-svm(school_state_diff ~., data = ELA_SVM_Train, scale = FALSE, kernel = "linear")
                     
svm_ela_base

Call:
svm(formula = school_state_diff ~ ., data = ELA_SVM_Train, kernel = "linear", 
    scale = FALSE)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.03333333 
    epsilon:  0.1 


Number of Support Vectors:  217
Code
svm_ela_pred<-predict(svm_ela_base, ELA_DF_Test_X)

 mae(svm_ela_pred, ELA_DF_Test$school_state_diff)
[1] 4.650072
Code
 mean(abs(ELA_DF_Test$school_state_diff))
[1] 6.75
Code
  sd(abs(ELA_DF_Test$school_state_diff))
[1] 4.366192
Code
# 
Code
svmCoefs <- as.data.frame(t(coefficients(svm_ela_base)))

svmCoefs

SVM w/text feature Model

Code
ELA_SVM_Text_Train<-ELA_DF_Train%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`, school_state_diff, poetry_factor)
#Math_DF_Train_X
#Math_DF_Train$school_state_diff

svm_ela_text<-svm(school_state_diff ~., data = ELA_SVM_Text_Train, scale = FALSE, kernel = "linear")
                     
svm_ela_text

Call:
svm(formula = school_state_diff ~ ., data = ELA_SVM_Text_Train, kernel = "linear", 
    scale = FALSE)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.03225806 
    epsilon:  0.1 


Number of Support Vectors:  219
Code
ELA_DF_Test
Code
svm_ela_pred<-predict(svm_ela_base, ELA_DF_Test_X)

 mae(svm_ela_pred, ELA_DF_Test$school_state_diff)
[1] 4.650072
Code
 mean(abs(ELA_DF_Test$school_state_diff))
[1] 6.75
Code
  sd(abs(ELA_DF_Test$school_state_diff))
[1] 4.366192
Code
# 
Code
svmCoefs <- as.data.frame(t(coefficients(svm_ela_base)))

svmCoefs

Questions


At this phase, you might have more questions than answers. Document all those questions.

  1. I fit a random forest model that included features extracted from my dictionaries to predict school_state_diff. My plan is to also fit a random forest model that does not include my dictionary extracted features to see if there is a difference in the RMSE. Are there other things I should be looking at to evaluate the fit of these models?

  2. I tried a SVM regression model; however because there are so many different standards, I received error messages when my test set contained values of the variable standard that weren’t present in the training set. Is there a way to work around this? This was an issue for the Science and the Math tests but not for the ELA test since the standards are the same across all the grades. I can use a random forest for all of the tests without issue.

  3. I would like to do cross-validation and take the average of my RMSE to get a better sense of how the model is performing. Will we have any samples of this in R-tutorials? I have done it in python but not in R.

  4. Once I work out these details with the fitting and evaluating the models, I would like to follow the same process of creating dictionaries, extracting new features, and then fitting and evaluating predictive models for the Science and ELA exams. Is there anything else you think I should be incorporating in my process?